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Abstract 

Topological methods, including persistent homology, 
are powerful tools for analysis of high-dimensional 
data sets but these methods rely almost exclusively 
on thresholding techniques. In noisy data sets, 
thresholding does not always allow for the recovery 
of topological information. We present an easy to 
implement, computationally efficient pre-processing 
algorithm to prepare noisy point cloud data sets for 
topological data analysis. The topological de-noising 
algorithm allows for the recovery of topological infor- 
mation that is inaccessible by thresholding methods. 
We apply the algorithm to synthetically-generated 
noisy data sets and show the recovery of topological 
information which is impossible to obtain by thresh- 
olding. We also apply the algorithm to natural image 
data in R 8 and show a very clean recovery of topolog- 
ical information previously only available with large 
amounts of thresholding. Finally, we discuss future 
directions for improving this algorithm using zig-zag 
persistence methods. 

1 Introduction 

An important challenge of data mining is the need 
to generate tools to simplify data down to important 
features or relationships. Topology, the mathematics 
that arises in an attempt to describe global features 
of a space via local data, can provide new insights 
and tools for finding and quantifying relationships 
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particularly on data that is not practical to under- 
stand using standard linear methods. The develop- 
ment of algorithms that compute topological invari- 
ants from realistic data to identify important internal 
relationships is quite recent [3] . One of the most well- 
developed of these algorithms, persistent homology, 
provides qualitative information, such as connected- 
ness and the presence of holes, in point cloud data 
(see [14 for more details) and has been shown to 
be a powerful tool for high-dimensional data analysis 
[7]. (See section [L3| for an introduction to topological 
analysis.) 

1.1 The Need for Topological De- 
Noising 

While topological methods are very useful for data 
analysis, outliers create problems for any topologi- 
cal data analysis method that is applied to a whole 
data set. Thresholding, the technique that filters the 
data based on some parameter, such as density, and 
returns a percentage of the points within a range of 
that parameter, such as only the top 10% densest 
points, has been useful in limiting the effects of noise 
and, currently, topological data analysis relies mostly 
on thresholding techniques. Thresholding, however, 
does not always allow for the recovery of the cor- 
rect topological structure; this is especially true in 
instances were the level of noise is quite high. 

We have encountered a number of real data sets 
generated from experimental scientists from a vari- 
ety of fields that, for external reasons, we believe 
contain non-trivial topological information. The fail- 
ure of thresholding methods, however, to limit the 
effects of noise in these instances has rendered topo- 



logical methods ineffective. A simple example where 
density thresholding does not allow persistent homol- 
ogy to recover the correct topological structure can 
be seen in a noisy sampling of the unit circle in R 2 . 



Section 1.4 is a detailed discussion of this example. 

Standard de-noising or regularization methods, 
such as Tikhonov regularization and the conjugate 
gradient method, often quickly become computation- 
ally inefficient as the dimension of the data increases. 
Principal curves [9 , the generative topographical 
mapping [2] , and graph Laplacian methods for mani- 
fold denoising [10] all address the problem of identify- 
ing a noisily sampled submanifold M in R n . The first 
two methods require the user to know the intrinsic di- 
mension of the submanifold, which can be difficult to 
estimate in the presence of high-dimensional noise. 
Graph Laplacian methods are limited to finding low- 
dimensional submanifolds in R n . 

1.2 Summary of Contributions 

In section [2] we present a computationally tractable 
technique that can take noisy data as its input, pos- 
sibly in high dimensions, and return data that has 
less noise but still retains the underlying topological 
information from the original data set regardless of 
the dimension of the underlying manifold. This is 
followed by a discussion of the intuition associated 
with this algorithm. In section [3] we apply the algo- 
rithm to synthetically-generated noisy data sets and 
show the recovery of topological information which is 
impossible to obtain by thresholding. Section [5] con- 
tains the results of applying this algorithm to natu- 
ral image data in R 8 and shows a very clean recovery 
of topological information previously only available 
with large amounts of thresholding. Finally, we dis- 
cuss future directions for improving this algorithm 
using zig-zag persistence methods. 

1.3 Introduction to Topological Data 
Analysis 

Persistent homology provides information about a 
point cloud data set by computing the homology of a 
sequence of nested spaces built on the data set. Gen- 
erally, we say that the homology of a space captures 



the connectivity, number of loops, and higher dimen- 
sional voids in the space. The homology of a space 
can be represented as Betti numbers, written {/3 n }, 
which gives a count of the n-dimensional voids in the 
space. More specifically, /3q measures the number of 
components in the simplicial complex; fi\ counts the 
number of tunnels, loops that cannot be deformed to 
a point in the complex; and /?2 counts the number of 
voids, enclosed spaces that look like a room. 

For example, if X is the unit circle in the plane, 
its non-zero Betti numbers are /3q = Pi = 1 since X 
is connected and has a non-trivial loop that cannot 
be deformed to a point while staying in the space. If 
X is the figure eight, its non-zero Betti number are 
A) = Pi = 2 since it has two non-trivial loops that 
cannot be deformed to a point or to each other. If X 
is the space resulting from removing an open unit ball 
from R 3 , its non-zero Betti numbers are Pq = P2 = 1 
because X is connected and has a "two-dimensional" 
non-trivial loop. If we had removed k disjoint balls 
from R 3 , fa would have been k. 

To compute homology from point cloud data D, 
the algorithm of persistent homology builds a se- 
quence of nested simplicial complexes with the points 
in D being the vertices of the complexes (see [3] 
for more details). Specifically, a simplicial complex 
K is a space built of vertices, edges, triangles, and 
their higher-dimensional counterparts (called sim- 
plices) where any face of a simplex in K is also in K 
and the intersection of any two simplices is the face of 
both of those simplices. Nested simplicial complexes 
are built from a point cloud data as follows. Let {e n } 
be an increasing sequence of positive real numbers. 
Define the simplicial complex K n for each e n so that 
the vertex set is the data D and two points p,q G D 
are connected by an edge in K n if the distance be- 
tween p and q is less than e n . Moreover, any subset 
of m points in D form an m— simplex in K n if all 
the points in the subset are within e n distance. This 
simplicial complex is known as a Vietoris-Rips com- 
plex. It is not hard to see that that if e n < e m then 
K n C K m . 

The homology is computed for each complex and, 
through a special property of topology, we can deter- 
mine which n-dimensional loops persist from one sim- 
plicial complex to the next complex in the sequence. 
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We use this information to generate barcodes to dis- 
play the homology of this sequence of complexes. For 
each dimension, the number of Betti numbers for each 
complex is plotted and if an n-dimensional loop per- 
sists from one complex to the next, a line in the bar- 
code connects the Betti numbers for those complexes. 
Figure [I] is an example of a sequence of simplicial 
complexes built from a point cloud data set sampled 
from a figure eight as well as the corresponding 0- and 
1-dimensional barcode associated to that sequence. 

If a line, say the Betti 1 barcode, persists through 
many of these nested simplicial complexes, then this 
is evidence that a 1-dimensional hole is present in 
the data. For instance, in figure [I] the long bar for 
the Betti barcode indicates that the data can be 
considered as a single connected component (or, in 
other words, a single cluster). Likewise, the two long 
lines for the Betti 1 barcode indicate that there are 
two 1-dimensional holes in the data. 



1.4 



Example 
Fails 



where Thresholding 



As mentioned in section [Ll] thresholding fails to re- 
cover the correct topological structure from a noisy 
sampling of the unit circle in R 2 . We can create 
such a data set by sampling the probability density 
function generated by convolving the delta function 
that has support being the unit circle in R 2 with the 
spherically-symmetric 2-dimensional Gaussian func- 
tion of standard deviation a. More specifically, we 
sample the probability distribution generated by nor- 
malizing the following function: 



Pa(x) 



II 
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where Sc(x) is the delta function that has support 
being the unit circle in R 2 . 

Note: By sampling the probability density func- 
tion, we mean that we create a set of N points via the 
following method: randomly select a point x within 
the support of the function p and then randomly se- 
lect a number between and 1. If p(x) is greater 
than the selected number, x is included in the set; 
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Figure 1: Reading persistence barcodes. The point 
cloud data set here is a sampling of a figure eight. 
A nested sequence of simplicial complexes is built. 
The x-axis on corresponds to the maximal distance 
allowed for including simplices these complexes. The 
number of lines above an x— value is the Betti number 
of the corresponding simplicial complex. Horizontal 
lines in the Betti n barcode indicate n— dimensional 
voids that persist from one complex to the next in the 
sequence. A long line in a Betti n barcode is evidence 
that there is an n— dimensional void in the data. 



otherwise, x is disregarded. Continue until the set 
contains TV points. 

When the standard deviation a on the Gaussian 
function in p(x) is small, the resulting density func- 
tion has a significant crater over the origin (see fig- 
ure [2]). Persistent homology with thresholding recov- 
ers the generating circle from data sampled from such 
a density function. If, on the other hand, the density 
function has a very shallow crater or none at all (such 
as the function in figure [3]), thresholding does not al- 
low for the recovery of the generating circle. 

In the following example we sample from a density 
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Figure 2: Probability density function for the unit 
circle in R 2 with a spherically-symmetric Gaussian 
noise of standard deviation a = 0.5 
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Figure 3: Probability density function for the unit 
circle in R 2 with a spherically-symmetric Gaussian 
noise of standard deviation a = 0.7 
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Figure 4: A noisy sampling of the unit circle in R 2 . 
This point-cloud data set K contains 1000 points 
sampled from the distribution in Figure 1 



Figure 5: An example of thresholding applied to the 
data set K from figure [4j Ten percent densest points 
in K using the 30th nearest neighbor method to es- 
timate density 




Figure 6: An example of thresholding applied to the 
data set K from figure [4j Ten percent densest points 
in K using the 75th nearest neighbor method to es- 
timate density 

function pqj{x) generated with a Gaussian function 
that has standard deviation a = 0.7 (see Figure [3]) 
The point-cloud data set K in figure [4] is a sampling 
of 1000 points from this probability density function. 

Topological analysis of the data set K, with any 
level of density thresholding, fails to recover a per- 
sistent first Betti number of one (refer to section [L3| 
for explanation of Betti numbers and barcodes). For 
thresholding in this paper, we use the nearest neigh- 
bor estimation of the density of a point cloud data 
set X. The nearest neighbor estimator at x G D is 
defined to be inversely proportional to the distance 
from the point x to the k-ih nearest neighbor of x 
(see [12] for more details). Figures [5] and [6] are two 



examples of density thresholding on the data set K 
in figure [4] The data set in figure [5] contains the 10% 
densest points of K when the 30th nearest neighbor 
density estimator is used. The data set in figure [6] 
contains the 10% densest points of K when the 75th 
nearest neighbor density estimator is used. 

Figures [7] and [8] are two examples of persistent ho- 
mology barcodes for Betti 1 of the data sets in fig- 
ures [5] and [6| respectively. These barcodes were gen- 
erated using Vietoris-Rips complexes built on the re- 
spective data sets. One can see that both thresholded 
data sets fail to recover a persistent first Betti num- 
ber of one (which indicates the presence of a circle in 
the data). Again, density thresholding of the data set 
K with a higher percentage of the data set or with 
different values of k used to compute the fc-th nearest 
neighbor density estimates fail to recover the correct 
topology from this data set. Thresholding also fails 
to recover the topology from data sets created from 
similar probability density functions as in figure [2] 
with smaller standard deviations as low as a = 0.5. 

Betti 1 Barcode 
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Figure 7: Persistence barcode for Betti 1 of the 10% 
densest points of K using 30th nearest neighbor den- 
sity estimation 
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Figure 8: Persistence barcode for Betti 1 of the 10% 
densest points of K using 75th nearest neighbor den- 
sity estimation 



2 Topological De-Noising 
Algorithm 

Let XCI d and let D be a finite set of points sam- 
pled from a probability density function on R d based 
around X. In other words, D is a noisy sampling of 
the space X. Let a be an estimate on the standard 
deviation of the probability density function used to 
generate D. 

Let So be a subset of points randomly selected from 
D (for example, So might have 10% of the points in 
D). We will recursively define new data sets S n as 
follows. 

For each n G N, define F n (x) to be the weighted dif- 
ference of the average of spherically-symmetric Gaus- 
sian functions of standard deviation a centered at 
points in D and 5 n , 

1 ^ -\\ X - P \\ 2 UJ ^ -ll*-rll 2 

F ^=m\2^ e 2ct2 -\s~\2^ e 2ct2 

where uo is a small positive number (typically in 
(.1,-5)). 

This algorithm iteratively maximizes the function 
F n (x) by translating the points in S n in the direction 
of the gradient of F n (x). Let set S n +i be the points 
in S n translated in the direction of the gradient of 
F n (x) a distance proportional to |VF n (p)| for each 
p G S n . Namely, 

#n+i = < P + c — \p e S n > 

where M = max(|VFo(p)|), the maximum gradient 

observed in Fq, and c is the maximum distance each 
point is allowed to translate at each step (typically c 
is set to be proportional to the maximum inter-point 
distance in D). 

2.1 Motivation and Intuition 

The algorithm presented above is a modification 
of the mean-shift clustering algorithm presented by 
Fukunaga and Hostetler in [8 . Let 

. 1 -N*-rii 2 
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Note that f(x) is a kernel density estimator of the 
data set D. The mean-shift clustering method it- 
eratively translates the points in D in the direction 
of the gradient of f(x). The results of this cluster- 
ing method is that the points converge to the max- 
ima of the function f(x). In their paper, Fukunaga 
and Hostetler also note that in situations with only 
a small amount of noise, a small number of itera- 
tions of their method can be used as a data filtering 
method to recover the topology of the data (more it- 
erations quickly sends the points to the maxima of 
f(x)). Their algorithm fails to recover the topology 
in significant amounts of noise and quickly converges 
to the maxima of f{x). 

The intuition about how the algorithm in section [2] 
recovers the underlying topology from a noisy data 
set is that the kernel density estimator f(x) contains 
more information that just the maxima of the func- 
tion. We believe that significant topological informa- 
tion from the data set D is found in regions of the 
domain where the density of the points is high, and 
thus the value of the kernel density estimator f(x) is 
relatively high. Moreover, we believe that the regions 
that have high density and the kernel density estima- 
tor f(x) has small gradient are the most promising 
for the recovery of topological information. 

The idea is to fit another kernel density estimator, 
g(x), to f(x) in these regions. The intuition is that 
low gradients on the kernel density function f(x) of- 
ten arise over the space X under Guassian-like noisy 
sampling of X. For example, if a space X has two 
components and is sampled with Guassian noise, the 
resulting kernel density estimator f(x) will have a low 
gradient somewhere over each of the components of 
X even if the level of noise is very high. The gradient 
need not be zero since the distance between the com- 
ponents may be small compared to the level of noise. 
See figure [9] for such an example. By fitting the func- 
tion g{x) to the function f(x) in these regions we 
find the positions of the points S n that could, under 
the estimated noise, generate similar density function 
estimator to the function f(x) in these regions. 

Translating the points in S n in the direction of the 
function F n (x) allows for the recovery of the topology 
of the space X. While the first term of the function 
F n (x) (i.e. f(x)) serves to cluster the points in S n to 



the maxima /(#), the second term in F n (x), ujg n (x), 
serves to repel the points in S n away from each other. 
When uj is a small positive number, the optimization 
of F n (x) is equivalent to fitting g n (x) to f(x) in re- 
gions where the gradient of f(x) is small and the value 
of f(x) is relatively high (in other words, there is a 
significant density of points in D in the region and 
the estimated density there is relatively flat). This is 
because when the gradient of f(x) is large, the sec- 
ond term is relatively small and the points in S n tend 
toward regions of high f(x) values. Within regions 
where f(x) is relatively high and the gradient of f(x) 
is low, the second term ug n (x) balances f(x) and 
allows for the fitting of within these regions. 



Probability Density 
Estimate f{x) 




Original Data 



Figure 9: Intuition example. Above is a one- 
dimensional data set D sampled noisily from the 
points -1 and 1. A kernel density estimate f(x) is 
plotted above. Notice that f(x) has two regions that 
have high density but low gradient. The output S100 
from the de-noising algorithm is shown slightly above 
the data set D. Notice that the output is gathered 
over the regions of f(x) that have high density and 
small gradient. 



3 Results with Synthetically- 
Generated Data 

3.1 Synthetically- Generated Noisy 
Circle 

First we will apply the above algorithm to the data 
set K, the noisy sampling of the unit circle in R 2 , 



presented in section 1.4 Recall that the data set 
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K shown in figure [4] is 1000 points sampled from the 
probability density function shown in figure [3j Recall 
as was shown in section [OJ density thresholding was 
unable to recover the correct first Betti number of 
the circle used to generate K (see figures [7] and §) 
To apply the above algorithm to we need a ran- 
domly selected subset of points So- For this example, 
we chose So to have 100 points randomly selected 



Belli 1 Barcode 



from K (see figure 10 for a plot of So and figure 11 



for the Betti 1 persistence barcode of So). Recall that 
this algorithm has parameters for a the estimated 
standard deviation of the noise, uo the weighting fac- 
tor for F n (x), and c, the maximal distance each point 
is allowed to translate per iteration. Here we chose 



a = 0.6, uj = 0.1 and c = 0.05. Figure [12| shows 6200? 
the output of this algorithm after 200 iterations. Fig- 
ure [13] shows the Betti 1 persistence barcode gener- 
ated from the Vietoris-Rips complex on 6200- Notice 



the clean recovery of the circle in figure 12 as well as 



the persistent first Betti number of one in figure 13 



Very similar persistence barcode results are ob- 
tained by letting a G (0.4,0.6) as well as when the 
number of points in the set So ranges from 75 to 200. 




Figure 10: So is 100 points selected randomly from 
K, the noisy circle data. 



3.2 Synthetically-Generated Noisy 
Sphere 

Next we apply the de-noising algorithm from sec- 
tion [2] to a noisy sampling of the unit sphere in R 3 . 
We generate this data set in a similar fashion to how 
the data set K as generated in |1.1| Specifically, we 



9 1.0 



Figure 11: Betti 1 persistence barcode of So 
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Figure 12: S200, the output of the algorithm from 
section [2] after 200 iterations applied So G K with 
a 0.6, uo = 0.1, and c = 0.05 

sample the density function 



r r -\\v-s\r 
p(x) = // S S 2(y)e 2(0.3)2 d g 

R 3 

where £52 (y) is the delta function that has support 
being the unit sphere in R 3 and the spherically- 
symmetric Gaussian function has standard deviation 
0.3. Let L be the 1000 points sampled from the den- 
sity function p(x). 

Density thresholding on the data set L fails to re- 
cover the underlying sphere. Figures 14 and 15 are 
two examples of Betti 1 



and Betti 2 persistent ho- 
mology barcodes of density thresholding on the data 
set L. The barcode in figure [l4| was generated from 
a data set that contains the 30% densest points of L 
when the 15th nearest neighbor density estimator is 
used. The barcode in figure [15] was generated from 
a data set that contains the 30% densest points of S 
when the 30th nearest neighbor density estimator is 
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Figure 13: Betti 1 persistence barcode of 6200, the 
output from the noisy circle data 

used. These barcodes were generated using lazy wit- 
ness complexes on 100 randomly chosen landmarks 
(see [5] for a detailed discussion on lazy witness com- 
plexes). Notice that neither barcode shows a second Fi S ure 15: Betti 1 and Betti 2 persistence barcode of 

noisy sphere. This is generated from the 30% dens- 
est points of L using 30th nearest neighbor density 
estimation 



1.12 1.26 1.4 



Betti number of one. 

Betti 1 Barcode 
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Figure 14: Betti 1 and Betti 2 persistence barcode of 
noisy sphere. This is generated from the 30% dens- 
est points of L using lbth nearest neighbor density 
estimation 

We can apply the algorithm from section [2] to the 
data set L as follows. Let Sq be a randomly chosen 
100 point subset of the data set L. Run the algo- 
rithm for 200 iterations with a = 0.35 and c = 0.05. 
The resulting data set 6200 was used to create the 
Vietoris-Rips complex that generated the barcodes 
in figure [TBJ Notice the persistent second Betti num- 
ber is clearly one and there is no persistent first Betti 
number. 

3.3 Synthetically-Generated Noisy 
Point 

For the sake of completeness, we have included the 
results of our algorithm applied to a noisy sampling 
of a point in R 2 . All parameters, except the num- 
ber of iterations, were chosen to be those from the 



Figure 16: Betti 1 and Betti 2 persistence barcode of 
£200 , the output from the noisy sphere L 



noisy sampling of the circle presented above in sec- 
The results below are produced with only 



tion 3.1 



50 iterations. After 50 iterations, the algorithm cor- 
rectly clustered all the points to a point. For the 
sake of presentation, we show the results at fewer 
iterations. The original data set and the resulting 
output is shown in figure [T7| The Betti and Betti 



1 barcode outputs for the output from our algorithm 



are in figure 18 



4 Results from Image Process- 
ing Data 

Natural image statistics have attracted much inter- 
est in recent years. In [5], [7], and [6], persistent 
homology was used to understand the local behavior 
of natural images. Instead of viewing the image as a 
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Figure 17: A noisy sampling P of a point in ] 
lowed by the output 650 where a = 0.6 and uo 



I 2 fol- 
= 0.1. 
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Figure 18: Betti and Betti 1 persistence barcode of 



£50 , the output from the noisy point P in figure 17 



whole, they analyzed the structure of high-contrast 
pixel patches. Each of the papers above analyzed the 
data set M that is a collection of 4 • 10 6 '3 by 3' high 
contrast pixel patches obtained from the collection of 
still images gathered by J. H. van Hateren and A. van 
der Schaaf [13] . M is a subset of a larger set M with 
8 • 10 6 images provided by K. Pedersen. 

The space M is obtained by applying the following 
procedure to a subset of images from a still image 
collection (See [11 for more details). 

1. Select an image from the still image collection. 

2. Extract at random 5000 '3 by 3' patches from 



the image. Regard each patch as a vector in 9- 
dimensional space. 

3. Next, for each patch do the following: 

(a) Compute the logarithm of intensity at each 
pixel. 

(b) Subtract an average of all coordinates from 
each coordinate. This produces a new 9- 
vector. 

(c) For this vector of logarithms compute the 
contrast or "D-norm" of the vector. The 
D-norm of a vector x is defined as Vx T Dx, 
where D is a certain positive definite sym- 
metric 9x9 matrix. 

(d) Keep this patch if its D-norm is among the 
top 20 percent of all patches taken from the 
image. 

4. Normalize each of the selected vectors by divid- 
ing by their respective D-norms. This places 
them on the surface of a 7-dimensional ellipsoid. 

5. Perform a change of coordinates so that the 
resulting set lies on the actual 7-dimensional 
sphere in M 8 . 

4.1 Previous Persistence Results with 
Thresholding 

Because the space of patches M is distributed 
throughout the unit disk in R 8 , not just in high- 
density spaces, all previous topological analysis of 
this space has required density thresholding. Fig- 
ure [l9j a Betti 1 persistence barcode on a subset of 
M of size 10 4 , illustrates the inability of persistence 
homology to recover any topological structure with- 
out some kind of filtering or de-noising. One exam- 
ple of this is in [5], where persistence homology was 
used to analyze dense subspaces within the space of 
patches. Working on a randomly chosen subset of M 
containing 5 • 10 4 points, the authors of [5] showed 
that the 30% densest points in this subset, using the 
300th nearest neighbor density estimator, were gath- 
ered around an annulus. They showed that this dense 
annulus can be characterized by the angle of the edge 
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separating a dark region of the pixel patch from a 
light region. 

This annulus can clearly be seen in the plot of the 
first two dimensions of a similarly generated dense 
subset of M in figure [20j Figure |2l] is the Betti 1 
persistence barcode for this data set. This barcode is 
generated using a lazy witness complex built on 100 
randomly chosen landmarks. 

Betti Barcode 
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Figure 19: Betti and 1 persistence barcodes of a 
subset of M of size 10 5 . 




Figure 20: First two coordinates of 30% densest 
points of a 50, 000 point subset of M using 300th 
nearest neighbor density estimator 



olding. Let M be a randomly chosen subset of M 
that contains 1 • 10 5 points and let So be a randomly 
chosen subset of M that contains 500 points. The 
algorithm was applied to M and So with three dif- 
ferent values of a. Namely, when <j = 0.4, 0.5, and 
0.6. Each instance of the algorithm was run for 100 
iterations, c = 0.05 and uo = 0.1. 

Figures 22j 23 , and 24 show the Betti 1 persis- 
tence barcodes generated on the de-noised data sets 
Sfo1) ^ 5 ioo ' 5 > and S ioo°' 6 respectively. Each bar- 
code is generated using a lazy witness complex built 
on 100 randomly chosen landmarks. For reference, 
the plot of the first two coordinates of the dense sub- 
set shown in figure [20} figures [25l [26l 



and [27] are the 

respective plots of the first two coordinates of Sfol) " 4 , 
Sfol) ' 5 , and ^fol) ' 6 - Notice the clear presence of a per- 
sistent first Betti number of one in each denoised data 
set. 



Betti 1 Barcode 



0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 

Figure 22: Betti 1 persistence barcodes of Sioo ap- 
plied to M using (7 — 0.4 
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Figure 23: Persistence barcodes of Sioo applied to M 
using a = 0.5 



Betti 1 Barcode 



0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.7 

Figure 21: Betti 1 persistence barcode of 30% densest 
points of a 50,000 point subset of M using 300th 
nearest neighbor density estimator 



Betti 1 Barcode 



0.014 0.028 0.042 0.056 07 0.084 0.098 0.112 0.126 0.14 



4.2 Results of De-Noising Algorithm Figure 2 4: Betti 1 persistence barcodes of Sioo ap- 

We can obtain similar results from the data set M by P lied to M usm S a = °- 6 
using the algorithm from section [2] instead of thresh- 
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Figure 25: First two coordinates of S100 applied to Figure 27: First two coordinates of £100 applied to 
M using a = 0.4 M using a = 0.6 




Figure 26: First two coordinates of 6100 applied to 
M using <j = 0.5 

5 Results from Range Image 
Data 

An optical image has a grayscale value at each pixel, 
whereas a range image pixel contains a distance: the 
distance between the laser scanner and teh nearest 
object in the correct direction. As before with the 
natural image data, we can think of an n x m pixel 
patch as a vector in R nxm and a set of patches as 
a set of points in R nxm . In pQ ? persistent homology 
was used to understand the local behavior of range 
image patches. As was done in the natural image 
work, the data studied was high-contrast range pixel 
patches. Previous work done on range image patches 
in pj] found that 3x3 range patches simply broke 
into clusters without any obvious simple geometry. 
In PQ, the study was extended to larger 5x5 and 



7x7 range patches. 

These patches were normalized in much of the same 
was as the natural image data described in section [5] 
The resulting data set contains 50,000 high-contrast, 
normalized range image patches. 

Similar to the results on the natural image data, 
the authors of pQ showed that the 30% densest points 
in this subset, using the 300th nearest neighbor den- 
sity estimator, were gathered around an annulus. 
They showed that this dense annulus can be charac- 
terized by the angle of the linear gradients between a 
dark region of the pixel patch to a light region. Fig- 
ure [28] contains the Betti and Betti 1 barcodes for 
30% densest points in the 7x7 image range patches 
where 300th nearest neighbor density estimator was 
used. 

Figure [29| contains the results of the algorithm from 
section [2] applied to a randomly chosen 5000 point 
subset of the 7x7 range image patches. These results 
were generated using 100 iterations of the algorithm 
with (7 = 0.5 and uj = 0.1. 

6 Discussion and Future Direc- 
tions 

The algorithm presented in section [2] uses spherically- 
symmetric Gaussian functions to estimate the density 
of the generating probability density functions used 
F n (x). It is likely that similar spherically-symmetric 
kernels such as the Epanechnikov kernel (see page 
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Belli Barcode 



0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 
Betti 1 Barcode 

0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 

Figure 28: Betti and 1 persistence barcodes the 30% 
densest points in the 7x7 range image patch data. 
Here the 300th nearest neighbor density estimate was 
used. 

Betti 1 Barcode 



0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 
Betti 1 Barcode 



0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 

Figure 29: Betti and 1 persistence barcode of Sioo 
applied to the 7x7 range image data using a = 0.5 
and uj = 0.1. 

76 of [12] for more details) would produce similar 
results. The use of a spherically- symmetric kernel 
implies that the data and noise are symmetric. 
In certain circumstances, it may be appropriate 
to pre-scale the data to avoid extreme differences 
in the spread in the various coordinate directions [12] . 

A significant benefit of this method is that it 
performs quickly even when the ambient dimension 
is high. Specifically, the runtime analysis of the 
algorithm is d 2 NMn where d is the dimension of the 
ambient Euclidean space which contains data set, 
TV is the number of data points in the data set D, 
M is the number of points in the set So, and n is 
the number of iteration of the algorithm. Moreover, 
the implementation of this algorithm is very similar 
to that of the mean-shift algorithm and hence it 
should not be difficult to implement in a parallel 
programming paradigm. 



6.1 Stopping Criterion, Ill-posed 
Problem, and Zig-Zag Persistence 

As n increases, the data sets S n tend towards the 
same topology as X while decreasing the amount 
of noise. Since the convergence of this algorithm 
depends on the choice of So, it is unclear at which 
n to stop this algorithm. If the researcher has prior 
knowledge about the underlying space X such as its 
dimension, then you can stop when S n has the same 
estimated dimension. In the future we plan to use a 
new method known as zig-zag persistence (see [4]) 
to determine the stopping criterion in the instance 
when there is no prior knowledge about the space X. 

Note also that this algorithm is an ill-posed 
problem in that solutions obtained from applying 
this technique to a different subset S' C D will 
produce different output. While we expect most of 
the output of this algorithm to retain the correct 
topology of D, it is possible that some output will 
not recover the correct topology. 

Zig-zag persistence, which is a new method for 
studying persistence of topological features across a 
family of point-cloud data sets regardless of the di- 
rection of the arrows between these sets, should allow 
us to obtain a measure of confidence in our results as 
well as to determine an appropriate stopping crite- 
rion. Let {5'q}^ 1 be a collection of m randomly cho- 
sen subsets of D. Run the algorithm above on each 
set Sq a fixed number of times, say n (for instance, 
n = 4c wnere ^ i s the largest distance between any 
two points in the original data set D). We then ob- 
tain the following sequence of data sets: 

S^ ^ S^ U S^ S^ S^ U S^ ... S™ 

In the future we plan to use zig-zag persistence to 
correlate topological features across these different 
instances of this de-noising algorithm to give us an 
accurate picture of the topology of the data set D. 
In other words, if the majority of the S l n show a 
second Betti number of 1, zig-zag persistence should 
tell whether persistence homology is detecting the 
same class generating the 2-cycle in each space 
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or whether there are many different 2-cycles each 
observed in different spaces. Moreover, even if not all 
the SnS have converged to the point that persistent 
homology techniques can recover the topological 
features of D, zig-zag persistence should allow us to 
detect features present in any of the S n 's as well as 
knowledge about how these features persist through 
the various 5 n 's. 
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